Image
Catalog Queries with TAP (Table Access Protocol)
Contact authors: Leanne Guy
Last verified to run: 2024-01-31
LSST Science Pipelines version: Weekly 2024_04
Container Size: medium
Targeted learning level: beginner

Description: Explore the DP0.2 catalogs via TAP and execute complex queries to retrieve data.

Skills: Use the TAP service. Query catalog data with ADQL. Visualize retrieved datasets.

LSST Data Products: Object, ForcedSource, CcdVisit tables.

Packages: lsst.rsp, bokeh, pandas

Credit: Originally developed by Leanne Guy in the context of the Rubin DP0.1.

Get Support: Find DP0-related documentation and resources at dp0-2.lsst.io. Questions are welcome as new topics in the Support - Data Preview 0 Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there.

1. Introduction¶

This notebook provides an intermediate-level demonstration of how to use the Table Access Protocol (TAP) server and ADQL (Astronomy Data Query Language) to query and retrieve data from the DP0.2 catalogs.

TAP provides standardized access to catalog data for discovery, search, and retrieval. Full documentation for TAP is provided by the International Virtual Observatory Alliance (IVOA). ADQL is similar to SQL (Structured Query Langage). The documentation for ADQL includes more information about syntax and keywords.

Recommendation: review the ADQL recipes and other advice to optimize TAP queries provided in the documentation for the DP0-era RSP's Data Access and Analysis Tools.

Warning: Not all ADQL functionality is supported yet in the DP0 RSP.

1.1. Package imports¶

Import general python packages, the Rubin TAP service utilities, and bokeh and holoviews for interactive visualization.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas
from pandas.testing import assert_frame_equal
from astropy import units as u
from astropy.coordinates import SkyCoord
from lsst.rsp import get_tap_service, retrieve_query
import bokeh
from bokeh.io import output_file, output_notebook, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource, CDSView, GroupFilter, HoverTool
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
import holoviews as hv

1.2. Define functions and parameters¶

Set the maximum number of rows to display from pandas, and configure bokeh to generate output in notebook cells when show() is called.

In [2]:
pandas.set_option('display.max_rows', 20)
output_notebook()
Loading BokehJS ...

In general, the order of results from database queries cannot be assumed to be the same every time. This function sorts the data so that we can compare the results dataframes even if the records are not in the same order from the query. Since it is sorting, it also needs to reset the incremental index with set_index.

In [3]:
def sort_dataframe(df, sort_key='objectId'):
    df = df.sort_values(sort_key)
    df.set_index(np.array(range(len(df))), inplace=True)
    return df

2. Explore the DP0.2 schema¶

2.1. Create the Rubin TAP Service client¶

Get an instance of the TAP service, and assert that it exists.

In [4]:
service = get_tap_service("tap")
assert service is not None

2.2. Schema discovery¶

To find out what schemas, tables and columns exist, query the Rubin TAP schema.

This information is also available in the "Data Products Definitions" section of the DP0.2 documentation.

Create the query to find out what schemas are in the Rubin tap_schema, execute the query, and see that a TAP Results object is returned.

In [5]:
query = "SELECT * FROM tap_schema.schemas"
results = service.search(query)
print(type(results))
<class 'pyvo.dal.tap.TAPResults'>

Convert the results to an astropy table and display it.

In [6]:
results = service.search(query).to_table()
results
Out[6]:
Table length=5
descriptionschema_indexschema_nameutype
str512int32str64str512
Data Preview 0.1 includes five tables based on the DESC's Data Challenge 2 simulation of 300 square degrees of the wide-fast-deep LSST survey region after 5 years. All tables contain objects detected in coadded images.1dp01_dc2_catalogs
Data Preview 0.2 contains the image and catalog products of the Rubin Science Pipelines v23 processing of the DESC Data Challenge 2 simulation, which covered 300 square degrees of the wide-fast-deep LSST survey region over 5 years.0dp02_dc2_catalogs
ObsCore v1.1 attributes in ObsTAP realization2ivoa
A TAP-standard-mandated schema to describe tablesets in a TAP 1.1 service100000tap_schema
UWS Metadata120000uws

2.3. The DP0.2 catalogs¶

All the DP0 tables (catalogs) are in the "dp02_dc2_catalogs" schema (table collection).

Search for the DP0 schema name and store as a variable.

In [7]:
schema_names = results['schema_name']
for name in schema_names:
    if name.find('dp02') > -1:
        dp02_schema_name = name
        break
print("DP0.2 schema is " + dp02_schema_name)
DP0.2 schema is dp02_dc2_catalogs

Explore tables in the DP0.2 schema, ordering them by their database. This is the order in which they will appear presented to the user in the RSP Portal. Display the results to see the tables in the DP0.2 schema, which are the same tables that are presented via the Portal GUI, together with a description of each.

In [8]:
query = "SELECT * FROM tap_schema.tables " \
        "WHERE tap_schema.tables.schema_name = '" \
        + dp02_schema_name + "' order by table_index ASC"
print(query)

results = service.search(query)
results = results.to_table()
results
SELECT * FROM tap_schema.tables WHERE tap_schema.tables.schema_name = 'dp02_dc2_catalogs' order by table_index ASC
Out[8]:
Table length=11
descriptionschema_nametable_indextable_nametable_typeutype
str512str512int32str64str8str512
Properties of the astronomical objects detected and measured on the deep coadded images.dp02_dc2_catalogs1dp02_dc2_catalogs.Objecttable
Properties of detections on the single-epoch visit images, performed independently of the Object detections on coadded images.dp02_dc2_catalogs2dp02_dc2_catalogs.Sourcetable
Forced-photometry measurements on individual single-epoch visit images and difference images, based on and linked to the entries in the Object table. Point-source PSF photometry is performed, based on coordinates from a reference band chosen for each Object and reported in the Object.refBand column.dp02_dc2_catalogs3dp02_dc2_catalogs.ForcedSourcetable
Properties of time-varying astronomical objects based on association of data from one or more spatially-related DiaSource detections on individual single-epoch difference images.dp02_dc2_catalogs4dp02_dc2_catalogs.DiaObjecttable
Properties of transient-object detections on the single-epoch difference images.dp02_dc2_catalogs5dp02_dc2_catalogs.DiaSourcetable
Point-source forced-photometry measurements on individual single-epoch visit images and difference images, based on and linked to the entries in the DiaObject table.dp02_dc2_catalogs6dp02_dc2_catalogs.ForcedSourceOnDiaObjecttable
Metadata about the pointings of the DC2 simulated survey, largely associated with the boresight of the entire focal plane.dp02_dc2_catalogs7dp02_dc2_catalogs.Visittable
Metadata about the 189 individual CCD images for each Visit in the DC2 simulated survey.dp02_dc2_catalogs8dp02_dc2_catalogs.CcdVisittable
Static information about the subset of tracts and patches from the standard LSST skymap that apply to coadds in these catalogsdp02_dc2_catalogs9dp02_dc2_catalogs.CoaddPatchestable
Summary properties of objects from the DESC DC2 truth catalog, as described in arXiv:2101.04855. Includes the noiseless astrometric and photometric parameters.dp02_dc2_catalogs10dp02_dc2_catalogs.TruthSummarytable
Match information for TruthSummary objects.dp02_dc2_catalogs11dp02_dc2_catalogs.MatchesTruthtable

Useful terms related to TAP:
  • schema - database terminology for the abstract design that represents the storage of data in a database.
  • tap_schema - the specific schema describing the TAP service. All TAP services must support a set of tables in a schema named TAP_SCHEMA that describe the tables and columns included in the service.
  • table collection - a collection of tables. e.g., dp02_dc2_catalogs.
  • table - a collection of related data held in a table format in a database, e.g., the object table (dp02_dc2_catalogs.Object).
  • results - the query result set. The TAP service returns data from a query as a TAPResults object. Find more about TAPResults here.

3. Query the DP0.2 Object catalog¶

The Object catalog (dp02_dc2_catalogs.Object) contains sources detected in the coadded images (also called stacked or combined images).

3.1. Get the columns available for a given table¶

Request the column names, data types, descriptions, and units for all columns in the Object catalog, and display as a Pandas table (which will automatically truncate).

Notice that there are 991 columns available in the Object table.

In [9]:
results = service.search("SELECT column_name, datatype, description, unit from TAP_SCHEMA.columns "
                         "WHERE table_name = 'dp02_dc2_catalogs.Object'")
results.to_table().to_pandas()
Out[9]:
column_name datatype description unit
0 coord_dec double Fiducial ICRS Declination of centroid used for... deg
1 coord_ra double Fiducial ICRS Right Ascension of centroid used... deg
2 deblend_nChild int Number of children this object has (defaults t...
3 deblend_skipped boolean Deblender skipped this source
4 detect_fromBlend boolean This source is deblended from a parent with mo...
... ... ... ... ...
986 z_psfFlux_flag_apCorr boolean Set if unable to aperture correct base_PsfFlux...
987 z_psfFlux_flag_edge boolean Object was too close to the edge of the image ...
988 z_psfFlux_flag_noGoodPixels boolean Not enough non-rejected pixels in data to atte...
989 z_psfFluxErr double Flux uncertainty derived from linear least-squ... nJy
990 z_ra double Position in Right Ascension. Measured on z-band. deg

991 rows × 4 columns

There is no need to read through all the columns, which are also available in the "Data Products Definitions" section of the DP0.2 documentation.

Search all column names for a specific term in order to find columns of interest. For example, the next cell loops over all column names and, if a user-defined search string such as "coord" is found, prints the column name.

In [10]:
search_string = 'coord'
for cname in results['column_name']:
    if cname.find(search_string) > -1:
        print(cname)
coord_dec
coord_ra

In the cell above, change the search string and execute the cell to find other columns of interest, e.g., try "Flux", "flux", "blend", "z", etc. One of those options yields no matches.

Clean up.

In [11]:
del results

3.2. Cone search specifying the maximum number of records¶

RA, Dec constraints yield faster queries: The TAP-accessible tables are sharded by coordinate (RA, Dec). ADQL query statements that include constraints by coordinate do not require a whole-catalog search, and are typically faster (and can be much faster) than ADQL query statements which only include constraints for other columns.

A cone search on the Object table will be a common TAP query. In this example, a circle centered on (RA, Dec) = (62.0, -37.0), with a radius of 1 degree is used.

Define the central coordinates and search radius using Astropy SkyCoord and units.

In [12]:
center_coords = SkyCoord(62, -37, frame='icrs', unit='deg')
search_radius = 1.0*u.deg

print(center_coords)
print(search_radius)
<SkyCoord (ICRS): (ra, dec) in deg
    (62., -37.)>
1.0 deg

The TAP queries take the center coordinates and the search radius -- both in units of degrees -- as strings, so also define strings to use in the query statements below.

In [13]:
use_center_coords = "62, -37"
use_radius = "1.0"

For debugging and testing queries, it is often useful to only return a few records for expediency. This can be done in one of two ways, setting the TOP field in a query, or setting the maxrec parameter in the TAP service query. The two methods produce identical results, as demonstrated below.

Define the maximum records to return.

In [14]:
max_rec = 5

3.2.1. Set TOP¶

Use the "TOP" option to set the maximum number of records.

Build a query to find bright objects with magnitude < 18 and an extendedness = 0 in the r-band filter. Recall that it is always recommended to set detect_isPrimary = True (which means the source has no deblended children, to avoid returning both deblended and blended objects).

Order the results by descending flux, both in this example and the next -- otherwise, results are returned in random order and we would not be able to confirm that both methods for setting the number of records are equivalent.

Warning: Combining use of TOP with ORDER BY in ADQL queries can be dangerous, as in, may take an unexpectedly long time because the database is trying to first sort, and then extract the top N elements. It is best to only combine TOP and ORDER BY if your query cuts down the number of objects that would need to be sorted, first. (Which is the case in the examples below).

Execute the query, and confirm that only 5 records were retrieved.

In [15]:
query = "SELECT TOP " + str(max_rec) + " " + \
        "objectId, coord_ra, coord_dec, detect_isPrimary, " + \
        "g_cModelFlux, r_cModelFlux, r_extendedness, r_inputCount " + \
        "FROM dp02_dc2_catalogs.Object " + \
        "WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec), " + \
        "CIRCLE('ICRS', " + use_center_coords + ", " + use_radius + ")) = 1 " + \
        "AND detect_isPrimary = 1 " + \
        "AND r_extendedness = 0 " + \
        "AND scisql_nanojanskyToAbMag(r_cModelFlux) < 18.0 " + \
        "ORDER by r_cModelFlux DESC"
results = service.search(query)
assert len(results) == max_rec

3.2.2. Set maxrec¶

Execute the same query using the maxrec parameter instead of TOP, name the output "results1" instead of "results", and confirm that only 5 records were retrieved.

Use of maxrec is not preferred, compared to SELECT TOP. Most TAP services map maxrec to SELECT TOP on the back end anyways, and using SELECT TOP ensures that queries are copy-pasteable between interfaces (e.g., into the Portal Aspect).

Warning: Use of maxrec might return a DALOverflowWarning to warn the user that partial results have been returned. In this case, it is ok to ignore this warning, because we intended to return partial results by using maxrec.

In [16]:
query = "SELECT objectId, coord_ra, coord_dec, detect_isPrimary, " + \
        "g_cModelFlux, r_cModelFlux, r_extendedness, r_inputCount " + \
        "FROM dp02_dc2_catalogs.Object " + \
        "WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec), " + \
        "CIRCLE('ICRS', " + use_center_coords + ", " + use_radius + ")) = 1 " + \
        "AND detect_isPrimary = 1 " + \
        "AND r_extendedness = 0 " + \
        "AND scisql_nanojanskyToAbMag(r_cModelFlux) < 18.0 " + \
        "ORDER by r_cModelFlux DESC"
results1 = service.search(query, maxrec=max_rec)
assert len(results1) == max_rec
/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-8.0.0/lib/python3.11/site-packages/pyvo/dal/query.py:325: DALOverflowWarning: Partial result set. Potential causes MAXREC, async storage space, etc.
  warn("Partial result set. Potential causes MAXREC, async storage space, etc.",

3.2.3. Show the results from TOP and maxrec are identical¶

Convert the results to pandas data frames, and assert that the contents of the two tables are identical. There will be no output if they are identical.

In [17]:
assert_frame_equal(sort_dataframe(results.to_table().to_pandas()),
                   sort_dataframe(results1.to_table().to_pandas()))

For more detailed analysis of results, converting to a pandas dataframe is often very useful

In [18]:
results_table = results.to_table().to_pandas()
results_table
Out[18]:
objectId coord_ra coord_dec detect_isPrimary g_cModelFlux r_cModelFlux r_extendedness r_inputCount
0 1651422484454489150 61.553607 -36.340161 True 756266.786927 1.027650e+06 0.0 105
1 1651220174314974952 62.165179 -37.042481 True 667941.013732 1.025565e+06 0.0 104
2 1567965153859771386 61.757923 -37.259021 True 791906.212012 1.025209e+06 0.0 109
3 1650885922780129931 61.203036 -36.630631 True 786041.397310 1.017006e+06 0.0 106
4 1567956357766750525 62.156878 -37.250385 True 674630.755684 1.000605e+06 0.0 97
In [19]:
results1_table = results1.to_table().to_pandas()
results1_table
Out[19]:
objectId coord_ra coord_dec detect_isPrimary g_cModelFlux r_cModelFlux r_extendedness r_inputCount
0 1651422484454489150 61.553607 -36.340161 True 756266.786927 1.027650e+06 0.0 105
1 1651220174314974952 62.165179 -37.042481 True 667941.013732 1.025565e+06 0.0 104
2 1567965153859771386 61.757923 -37.259021 True 791906.212012 1.025209e+06 0.0 109
3 1650885922780129931 61.203036 -36.630631 True 786041.397310 1.017006e+06 0.0 106
4 1567956357766750525 62.156878 -37.250385 True 674630.755684 1.000605e+06 0.0 97

Clean up.

In [20]:
del results, results1, results_table, results1_table

3.3. Cone search with double table join¶

Create a similar query as above, for bright, non-extended Objects within a 1 degree radius, but now join with the ForcedSource table to obtain PSF photometry from the r-band processed visit images (PVIs), and join with the CcdVisit table to obtain the visits' Modified Julian Dates (MJDs).

Do not set a maximum number of records to return, because the goal is to generate a table containing the ForcedSource photometry for all of the Objects of interest. There is no need to use ORDER BY for this example, but it could be.

Notice: When restricting a JOIN query to just a portion of the sky (e.g., a cone search), restricting on coordinates in Object, DiaObject, or Source tables where possible can result in significantly improved performance.

In [21]:
query = "SELECT fs.forcedSourceId, fs.objectId, fs.ccdVisitId, fs.detect_isPrimary, " + \
        "fs.band, scisql_nanojanskyToAbMag(fs.psfFlux) as psfMag, ccd.obsStartMJD, " + \
        "scisql_nanojanskyToAbMag(obj.r_psfFlux) as obj_rpsfMag " + \
        "FROM dp02_dc2_catalogs.ForcedSource as fs " + \
        "JOIN dp02_dc2_catalogs.CcdVisit as ccd " + \
        "ON fs.ccdVisitId = ccd.ccdVisitId " + \
        "JOIN dp02_dc2_catalogs.Object as obj " + \
        "ON fs.objectId = obj.objectId " + \
        "WHERE CONTAINS(POINT('ICRS', obj.coord_ra, obj.coord_dec), " + \
        "CIRCLE('ICRS', " + use_center_coords + ", " + use_radius + ")) = 1 " + \
        "AND obj.detect_isPrimary = 1 " + \
        "AND obj.r_extendedness = 0 " + \
        "AND scisql_nanojanskyToAbMag(obj.r_cModelFlux) > 17.5 " + \
        "AND scisql_nanojanskyToAbMag(obj.r_cModelFlux) < 18.0 " + \
        "AND fs.band = 'r' "
results = service.search(query)

Take a look at how the output results are formatted from joined tables. Notice the obj_rpsfMag column has the same value for all rows with the same objectId. The table join from ForcedSource to Object is a many-to-one match.

In [22]:
results.to_table().to_pandas()
Out[22]:
forcedSourceId objectId ccdVisitId detect_isPrimary band psfMag obsStartMJD obj_rpsfMag
0 372758097422221495 1568360978045760071 694316062 True r 17.823767 60605.347236 17.839353
1 639286041945082666 1568360978045760071 1190763045 True r 17.835010 61325.198018 17.839353
2 351882936624121226 1568360978045760071 655433045 True r 17.826146 60550.281806 17.839353
3 234363525001150188 1568360978045760071 436536083 True r 17.828869 60206.283240 17.839353
4 490549749842213772 1568360978045760071 913720112 True r 17.829732 60910.316507 17.839353
... ... ... ... ... ... ... ... ...
41746 104088612949216950 1651272950873104986 193880150 True r 17.625196 59840.262845 17.633740
41747 565265039041250112 1651272950873104986 1052888183 True r 17.619891 61124.022116 17.633740
41748 565264464589375126 1651272950873104986 1052887113 True r 17.616255 61124.021671 17.633740
41749 563358012895409334 1651272950873104986 1049336070 True r 17.622543 61120.027661 17.633740
41750 634936914574590978 1651272950873104986 1182662164 True r 17.621899 61315.373327 17.633740

41751 rows × 8 columns

Investigate how many unique Objects were returned, and the distribution of the number of ForcedSources (visit images) per Object.

In [23]:
unique_objectIds, counts = np.unique(results['objectId'], return_counts=True)
print(len(unique_objectIds), ' unique objects returned')

fig = plt.figure(figsize=(4, 2))
plt.hist(counts, bins=20)
plt.xlabel('Number of ForcedSources')
plt.ylabel('Number of Objects')
plt.show()
383  unique objects returned
Image

Plot the time series of ForcedSource r-band photometry for one of the Objects. For this example, just take the first unique objectId from the list.

In [24]:
use_objectId = unique_objectIds[0]

tx = np.where(results['objectId'] == use_objectId)[0]

fig = plt.figure(figsize=(6, 4))
plt.axhline(results['obj_rpsfMag'][tx[0]], ls='solid', color='black',
            label='Object r-band PSF Magnitude (from deepCoadd)')
plt.plot(results['obsStartMJD'][tx], results['psfMag'][tx],
         'o', ms=10, mew=0, alpha=0.5, color='firebrick',
         label='ForcedSource r-band PSF Magnitude (from PVIs)')
plt.xlabel('MJD')
plt.ylabel('r-band magnitude')
plt.title('objectId = ' + str(use_objectId))
plt.legend(loc='lower left')
plt.show()
Image

Whether or not that randomly chosen Object is truly variable is left as an exercise for the learner.

Clean up.

In [25]:
del results, use_objectId, unique_objectIds, counts

Here is an example of an alternative query string for obtaining the same data in the case where the objectId is already known -- meaning, a case in which the Object catalog does not need to be queried via any columns aside from objectId.

SELECT fs.forcedSourceId, fs.objectId, fs.ccdVisitId, fs.detect_isPrimary, 
fs.band, scisql_nanojanskyToAbMag(fs.psfFlux) as psfMag, ccd.obsStartMJD, 
scisql_nanojanskyToAbMag(obj.r_psfFlux) as obj_rpsfMag 
FROM dp02_dc2_catalogs.ForcedSource as fs 
JOIN dp02_dc2_catalogs.CcdVisit as ccd ON fs.ccdVisitId = ccd.ccdVisitId 
JOIN dp02_dc2_catalogs.Object as obj ON fs.objectId = obj.objectId 
WHERE fs.objectId = 1567762843720227098 AND fs.band = 'r'

4. Visualize large data sets resulting from a query¶

In this section, the bokeh package is used to create interactive plots to explore a dataset. Multiple panels are used to show different representations of the same dataset. A selection applied to either panel will highlight the selected points in the other panel.

Bokeh Documentation
Holoviews Documentation

4.1. Data query¶

Select Objects within a central search area, with r-band magnitudes between 17 and 23 mag, and with a measured r-band extendedness parameter (that is not NaN or NULL).

Return the coordinates, gri magnitudes, and r-band extendedness.

Notice: The results are being converted to a Pandas table in the same line as the search is executed.

In [26]:
query = "SELECT objectId, detect_isPrimary, " + \
        "coord_ra AS ra, coord_dec AS dec, " + \
        "scisql_nanojanskyToAbMag(g_cModelFlux) AS mag_g_cModel, " + \
        "scisql_nanojanskyToAbMag(r_cModelFlux) AS mag_r_cModel, " + \
        "scisql_nanojanskyToAbMag(i_cModelFlux) AS mag_i_cModel, " + \
        "r_extendedness " + \
        "FROM dp02_dc2_catalogs.Object " + \
        "WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec), " + \
        "CIRCLE('ICRS', " + use_center_coords + ", " + use_radius + ")) = 1 " + \
        "AND detect_isPrimary = 1 " + \
        "AND scisql_nanojanskyToAbMag(g_cModelFlux) > 17.0 " + \
        "AND scisql_nanojanskyToAbMag(g_cModelFlux) < 23.0 " + \
        "AND scisql_nanojanskyToAbMag(r_cModelFlux) > 17.0 " + \
        "AND scisql_nanojanskyToAbMag(r_cModelFlux) < 23.0 " + \
        "AND scisql_nanojanskyToAbMag(i_cModelFlux) > 17.0 " + \
        "AND scisql_nanojanskyToAbMag(i_cModelFlux) < 23.0 " + \
        "AND r_extendedness IS NOT NULL "
results = service.search(query).to_table().to_pandas()

Display the Pandas table.

In [27]:
results
Out[27]:
objectId detect_isPrimary ra dec mag_g_cModel mag_r_cModel mag_i_cModel r_extendedness
0 1567798028092359809 True 61.290253 -37.823108 21.925642 21.900913 21.595578 1.0
1 1567929969487672063 True 61.113473 -37.588913 21.646285 21.416171 21.353752 1.0
2 1567929969487672098 True 61.124363 -37.589575 22.429075 22.332013 21.790280 1.0
3 1567929969487670941 True 61.003225 -37.599231 22.341257 20.771771 20.172821 1.0
4 1567929969487671127 True 61.082493 -37.597920 22.318689 21.846055 21.346547 1.0
... ... ... ... ... ... ... ... ...
45860 1651272950873090766 True 62.426652 -37.000392 22.632068 22.324121 21.602873 1.0
45861 1651272950873090754 True 62.382478 -37.000991 22.202981 20.767600 20.154292 1.0
45862 1651272950873104729 True 62.302607 -36.867658 22.780018 22.379061 21.688285 1.0
45863 1651272950873105956 True 62.281460 -36.853744 22.661237 22.340502 21.670607 1.0
45864 1651272950873105930 True 62.333096 -36.851390 22.328228 21.295011 20.767673 1.0

45865 rows × 8 columns

4.2. Data preparation¶

The basis for any data visualization is the underlying data.

The ColumnDataSource (CDS) is the core of bokeh plots. A CDS can be prepared from the data returned by a query, enabling it to be passed directly to bokeh. Bokeh automatically creates a CDS from data passed as python lists or numpy arrays. CDS are useful as they allow data to be shared between multiple plots and renderers, enabling brushing and linking. A CDS is essentially a collection of sequences of data that have their own unique column name.

Getting the data preparation phase right is the key to creating powerful visualizations.

Recall the center coordinates that were defined with astropy SkyCoord. Get the central RA and Dec as values (datatype float) and print them.

In [28]:
center_ra = center_coords.ra.deg
center_dec = center_coords.dec.deg
print(center_ra, center_dec)
62.0 -37.0

Create a python dictionary that associates the r_extendedness column with the Object being a star or a galaxy.

In [29]:
object_map = {0.0: 'star', 1.0: 'galaxy'}

Create a python dictionary to store the data from the query and pass to the ColumnDataSource. All columns in a CDS must have the same length.

In [30]:
data = dict(ra=results['ra'], dec=results['dec'],
            target_ra=results['ra']-center_ra,
            target_dec=results['dec']-center_dec,
            gmi=results['mag_g_cModel']-results['mag_i_cModel'],
            gmag=results['mag_g_cModel'],
            rmag=results['mag_r_cModel'],
            imag=results['mag_i_cModel'])
source = ColumnDataSource(data=data)

Additional data can be added to the CDS after creation, as demonstrated in the cell below with the addition of objectId and extendedness in the r-band.

Note that the objectId needs to be converted to a string, because bokeh can't interpret an integer value that high.

In [31]:
objIdString = np.array(results['objectId']).astype('str')
source.data['objectId'] = objIdString
source.data['r_extendedness'] = results['r_extendedness']

Use the object_map dictionary to add an object_type column to the CDS.

In [32]:
source.data['object_type'] = results['r_extendedness'].map(object_map)
source.data['object_type']
Out[32]:
0        galaxy
1        galaxy
2        galaxy
3        galaxy
4        galaxy
          ...  
45860    galaxy
45861    galaxy
45862    galaxy
45863    galaxy
45864    galaxy
Name: r_extendedness, Length: 45865, dtype: object

4.3. Plot a color-magnitude diagram¶

Use bokeh to plot a color-magnitude (g vs. g-i) diagram making use of the cModel magnitudes. Hover over the points in the plot to see their values.

Define the plot aesthetics and tools.

In [33]:
plot_options = {'height': 400, 'width': 400,
                'tools': ['box_select', 'reset', 'box_zoom', 'help']}

Define the hover tool.

In [34]:
tooltips = [
    ("Col (g-i)", "@gmi"),
    ("Mag (g)", "@gmag"),
    ("Mag (r)", "@rmag"),
    ("Mag (i)", "@imag"),
    ("Type", "@object_type")
]
hover_tool_cmd = HoverTool(tooltips=tooltips)

Create a color-magnitude diagram.

In [35]:
p = figure(title="Color - Magnitude Diagram",
           x_axis_label='g-i', y_axis_label='g',
           x_range=(-1.8, 4.3), y_range=(23.5, 16),
           **plot_options)

Color-code the different object types by defining a color map palette.

In [36]:
object_type_palette = ['darkred', 'lightgrey']
p.add_tools(hover_tool_cmd)
p.circle(x='gmi', y='gmag', source=source,
         size=3, alpha=0.6,
         legend_field="object_type",
         color=factor_cmap('object_type',
                           palette=object_type_palette,
                           factors=['star', 'galaxy']),
         hover_color="darkblue")
Out[36]:
GlyphRenderer(
id = 'p1052', …)
coordinates = None,
data_source = ColumnDataSource(id='p1002', ...),
glyph = Circle(id='p1048', ...),
group = None,
hover_glyph = Circle(id='p1050', ...),
js_event_callbacks = {},
js_property_callbacks = {},
level = 'glyph',
muted = False,
muted_glyph = Circle(id='p1051', ...),
name = None,
nonselection_glyph = Circle(id='p1049', ...),
propagate_hover = False,
selection_glyph = 'auto',
subscribed_events = PropertyValueSet(),
syncable = True,
tags = [],
view = CDSView(id='p1053', ...),
visible = True,
x_range_name = 'default',
y_range_name = 'default')

Show the interactive plot.

In [37]:
show(p)

4.4. Plot a color-color (r-i vs. g-r) diagram¶

Add a color-color (r-i vs. g-r) diagram and make use of the advanced linking features of bokeh to enable brushing and linking between the the color-magnitude diagram and this color-color plot.

In [38]:
source.data['rmi'] = results['mag_r_cModel'] - results['mag_i_cModel']
source.data['gmr'] = results['mag_g_cModel'] - results['mag_r_cModel']

The CMD above is very crowded. Below, filter on object type and plot stars only.

Use a GroupFilter to select rows from the CDS that satisfy object_type = "star".

In [39]:
stars = CDSView()
stars.filter = GroupFilter(column_name='object_type', group="star")

Define the various options for the plot, and create the hover tool.

In [40]:
plot_options = {'height': 350, 'width': 350,
                'tools': ['box_zoom', 'box_select',
                          'lasso_select', 'reset', 'help']}

hover_tool = HoverTool(tooltips=[("(RA,DEC)", "(@ra, @dec)"),
                                 ("(g-r,g)", "(@gmr, @gmag)"),
                                 ("objectId", "@objectId"),
                                 ("type", "@object_type")])

Create the three plots: spatial, color-magnitude, and color-color, then plot all three together on a grid.

Create the spatial plot.

In [41]:
title_spatial = 'Spatial centred on (RA,DEC) = '+use_center_coords

fig_spatial = figure(title=title_spatial,
                     x_axis_label="Delta RA", y_axis_label="Delta DEC",
                     **plot_options)
fig_spatial.circle(x='target_ra', y='target_dec',
                   source=source, view=stars,
                   size=4.0, alpha=0.6,
                   color='teal', hover_color='firebrick')
fig_spatial.add_tools(hover_tool)

Create the color-magnitude plot.

In [42]:
fig_cmag = figure(title="Color-Magnitude Diagram",
                  x_axis_label="g-r", y_axis_label="g",
                  x_range=(-1.0, 2.0), y_range=(23.5, 16),
                  **plot_options)
fig_cmag.circle(x='gmr', y='gmag', source=source, view=stars,
                size=4.0, alpha=0.6,
                color='teal', hover_color='firebrick')
fig_cmag.add_tools(hover_tool)

Create the color-color plot.

In [43]:
fig_cc = figure(title="Color-Color Diagram",
                x_axis_label="g-r", y_axis_label="r-i",
                x_range=(-1.0, 2.0), y_range=(-1.0, 2.5),
                **plot_options)
fig_cc.circle(x='gmr', y='rmi', source=source, view=stars,
              size=4.0, alpha=0.6,
              color='teal', hover_color='firebrick')
fig_cc.add_tools(hover_tool)

Show all three plots together in a row.

In [44]:
p = gridplot([[fig_spatial, fig_cmag, fig_cc]])
show(p)

Hover the mouse over the data points in any of the plots in order to use the hover tool to see information about individual data points (e.g., the objectId). Notice the data points highlighted in red on one panel with the hover tool are also highlighted on the other panels.

Click on the box select or lasso select icons in the upper right corner of the figure. Use the box select or lasso select functionality to make a selection in any panel by clicking and dragging to define a region. The selected data points will be displayed in darker teal in all panels.

Use the refresh icon to reset the plot.

Clean up.

In [45]:
del results, data, p, stars, source

5. Asynchronous TAP queries¶

All of the queries above were synchronous queries. This means that the query will continue executing in the notebook until it is finished, and no other cells can be executed during that time. The Jupyter code cell will show the asterisk to the left of the cell until the query completes and the results are returned. The asterisk will then become a number. Synchronous queries are good for short queries that take seconds to minutes.

For longer queries, or for running multiple queries at the same time, an asynchronous query is more suitable. With asynchronous queries, other Jupyter cells can be executed while the query is running, and results can be retrieved later. This is especially important for queries that are long or return a lot of results, and it also safeguards long queries against network outages or timeouts.

The results of a query will be the same whether it is run synchronously or asynchronously.

5.1. Define the query¶

Create a table join cone search query on the ForcedSource and CcdVisit tables, but with fewer conditions and a smaller radius than above.

In [46]:
use_radius = "0.1"

query = "SELECT fs.forcedSourceId, fs.objectId, fs.ccdVisitId, " + \
        "fs.detect_isPrimary, fs.band, " + \
        "scisql_nanojanskyToAbMag(fs.psfFlux) as psfMag, ccd.obsStartMJD " + \
        "FROM dp02_dc2_catalogs.ForcedSource as fs " + \
        "JOIN dp02_dc2_catalogs.CcdVisit as ccd " + \
        "ON fs.ccdVisitId = ccd.ccdVisitId " + \
        "WHERE CONTAINS(POINT('ICRS', fs.coord_ra, fs.coord_dec), " + \
        "CIRCLE('ICRS', " + use_center_coords + ", " + use_radius + ")) = 1 " + \
        "AND scisql_nanojanskyToAbMag(fs.psfFlux) > 17.5 " + \
        "AND scisql_nanojanskyToAbMag(fs.psfFlux) < 18.0 " + \
        "AND fs.band = 'r' "

print(query)
SELECT fs.forcedSourceId, fs.objectId, fs.ccdVisitId, fs.detect_isPrimary, fs.band, scisql_nanojanskyToAbMag(fs.psfFlux) as psfMag, ccd.obsStartMJD FROM dp02_dc2_catalogs.ForcedSource as fs JOIN dp02_dc2_catalogs.CcdVisit as ccd ON fs.ccdVisitId = ccd.ccdVisitId WHERE CONTAINS(POINT('ICRS', fs.coord_ra, fs.coord_dec), CIRCLE('ICRS', 62, -37, 0.1)) = 1 AND scisql_nanojanskyToAbMag(fs.psfFlux) > 17.5 AND scisql_nanojanskyToAbMag(fs.psfFlux) < 18.0 AND fs.band = 'r' 

5.2. Synchronous query¶

Run the synchronous query and wait for the results. Notice that the query takes one to two minutes.

In [47]:
results = service.search(query).to_table().to_pandas()
In [48]:
len_sync_results = len(results)
print(len_sync_results)
1343

5.3. Asynchronous query¶

Create and submit the job. This step does not run the query yet.

Then get the job URL and the job phase. It will be "pending" until the job is started.

In [49]:
job = service.submit_job(query)

print('Job URL is', job.url)

print('Job phase is', job.phase)
Job URL is https://data.lsst.cloud/api/tap/async/uyu0jricbz1p9qj4
Job phase is PENDING

Run the job, and see that the cell completes executing, even though the query is still running. Notice how waiting one to two minutes until another cell can be executed is not necessary with the asynchronous query, as it was with the synchronous version above.

In [50]:
job.run()
Out[50]:
<pyvo.dal.tap.AsyncTAPJob at 0x7fb14fd07f10>

Option: Use the following cell to tell python to wait for the job to finish. Only do this if there is no neeed to run anything else while waiting. The cell will continue executing until the job is finished (i.e., either the status COMPLETED or ERROR is returned). An asterisk will appear to the left of the cell until the job is finished, just like with a synchronous query.

In [51]:
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
Job phase is COMPLETED

A useful function to raise an exception if there was a problem with the query.

In [52]:
job.raise_if_error()

Once the job completes successfully, fetch the results.

In [53]:
async_results = job.fetch_result().to_table().to_pandas()

Assert that the length of the results of the asynchronous and synchronous executions of the query statement are the same.

In [54]:
assert len(async_results) == len_sync_results

Assert that the actual results are the same as obtained from executing synchronous queries.

Above, the sort_dataframe function was used with the default sort key, objectId, but this time the objectId is not unique in the results dataframes and so the forcedSourceId must be used.

In [55]:
assert_frame_equal(sort_dataframe(results, sort_key='forcedSourceId'),
                   sort_dataframe(async_results, sort_key='forcedSourceId'))

5.4. Retrieve the results from a previous asynchronous job¶

Job results are generally available from previously run queries, and can be retrieved if the URL of the job is known.

The code cell below gets the URL for the query above and prints it, then retrieves the results of the job and again asserts that the length is the same as the synchronous query results, and that the two dataframes are equal.

In [56]:
print(job.url)
retrieved_job = retrieve_query(job.url)

retrieved_results = retrieved_job.fetch_result().to_table().to_pandas()

assert len(retrieved_results) == len_sync_results

assert_frame_equal(sort_dataframe(retrieved_results, sort_key='forcedSourceId'),
                   sort_dataframe(async_results, sort_key='forcedSourceId'))
https://data.lsst.cloud/api/tap/async/uyu0jricbz1p9qj4

5.5. Delete the job¶

Once the job is finished and the results have been retrieved and analyzed (or if analysis revealed a mistake) delete the job and the results from the server.

Note that results will be deleted automatically if the notebook kernel is restarted.

In [57]:
job.delete()

Clean up.

In [58]:
del results, async_results, retrieved_results

6. Retrieve the results of a Portal query¶

It is also possible to retrieve the results of queries executed in the Portal Aspect of the Rubin Science Platform.

In a new browser tab, go to data.lsst.cloud and enter the Portal Aspect.

Click "Edit ADQL" at upper right, and copy-paste the following query into the ADQL box as shown in the screenshot below.

This query selects bright point sources within a 1 degree radius near the center of the DC2 region.

SELECT coord_dec, coord_ra, g_calibFlux, i_calibFlux, r_calibFlux 
FROM dp02_dc2_catalogs.Object 
WHERE CONTAINS (POINT('ICRS', coord_ra, coord_dec), CIRCLE('ICRS', 62.0, -37.0, 1)) = 1 
AND detect_isPrimary =1 
AND g_calibFlux >360 AND g_extendedness =0 
AND i_calibFlux >360 AND i_extendedness =0 
AND r_calibFlux >360 AND r_extendedness =0 

Click "Search".

Image

The default results view will appear as in the screenshot below.

Image

The search results have been automatically saved and assigned a URL.

Retrieve the URL by clicking the "Info" button (the letter i in a circle) to the upper-right of the table. The pop-up window contains the URL (the Job Link).

Image

Warning: Using the job link above will not work, it has expired.

Follow the instructions above to execute the query and obtain the URL.

Copy the URL into the empty quotes below to define my_portal_url.

Uncomment and execute the following cell to obtain the results of the Portal query here in the Notebook, as my_portal_results.

In [59]:
# my_portal_url = ''
# my_portal_job = retrieve_query(my_portal_url)
# my_portal_results = my_portal_job.fetch_result().to_table().to_pandas()
# my_portal_results